3 research outputs found
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
Differentiable all-pass filters for phase response estimation and automatic signal alignment
Virtual analog (VA) audio effects are increasingly based on neural networks and deep learning frameworks. Due to the underlying black-box methodology, a successful model will learn to approximate the data it is presented, including potential errors such as latency and audio dropouts as well as non-linear characteristics and frequency-dependent phase shifts produced by the hardware. The latter is of particular interest as the learned phase-response might cause unwanted audible artifacts when the effect is used for creative processing techniques such as dry-wet mixing or parallel compression. To overcome these artifacts we propose differentiable signal processing tools and deep optimization structures for automatically tuning all-pass filters to predict the phase response of different VA simulations, and align processed signals that are out of phase. The approaches are assessed using objective metrics while listening tests evaluate their ability to enhance the quality of parallel path processing techniques. Ultimately, an over-parameterized, BiasNet-based, all-pass model is proposed for the optimization problem under consideration, resulting in models that can estimate all-pass filter coefficients to align a dry signal with its affected, wet, equivalent.</p
Differentiable Allpass Filters for Phase Response Estimation and Automatic Signal Alignment
Virtual analog (VA) audio effects are increasingly based on neural networks
and deep learning frameworks. Due to the underlying black-box methodology, a
successful model will learn to approximate the data it is presented, including
potential errors such as latency and audio dropouts as well as non-linear
characteristics and frequency-dependent phase shifts produced by the hardware.
The latter is of particular interest as the learned phase-response might cause
unwanted audible artifacts when the effect is used for creative processing
techniques such as dry-wet mixing or parallel compression. To overcome these
artifacts we propose differentiable signal processing tools and deep
optimization structures for automatically tuning all-pass filters to predict
the phase response of different VA simulations, and align processed signals
that are out of phase. The approaches are assessed using objective metrics
while listening tests evaluate their ability to enhance the quality of parallel
path processing techniques. Ultimately, an over-parameterized, BiasNet-based,
all-pass model is proposed for the optimization problem under consideration,
resulting in models that can estimate all-pass filter coefficients to align a
dry signal with its affected, wet, equivalent.Comment: Collaboration done while interning/employed at Native Instruments.
Accepted for publication in Proc. DAFX'23, Copenhagen, Denmark, September
2023. Sound examples at https://abargum.github.io v2: 10 pages, LaTeX;
figures resized, pdf optimize